Fault-tolerant Distributed Applications In LiPS
نویسنده
چکیده
Performing computations using networks of workstations is increasingly becoming an alternative to using a supercomputer. This approach is motivated by the the vast quantities of unused idle-time available in workstation networks. Unlike computing on a tightly coupled parallel computer, where a xed number of processor nodes is used within a computation, the number of useable nodes in a workstation network is constantly changing over time. Additionally, workstations are more frequently subject to outages, eg. due to reboots. The question arises how applications, adapting smoothly to this environment, should be realized. This paper shows how fault-tolerant distributed applications are implemented within LiPS version 2.4, a system for distributed computing using idle-cycles in networks of workstation. This system currently is used at the Universitt at des Saarlandes in Saarbr ucken to perform computationally intensive applications on a net of approximately 250 workstations. The LiPS system Library for Parallel Systems employs the tuplespace programming paradigm, as originally used in the Linda 1 programming language. Applications implemented using this paradigm easily adapt to changes in availability as they occur in workstation networks. Applications are enabled to terminate successfully in spite of failing nodes by periodically writing checkpoints, freezing the state of a computational process in a le, and keeping track of messages exchanged inbetween checkpoints in a message log. By integrating message logging with the tuplespace, individual processes may restart from their latest checkpoints, while other processes may continue to run unaaected. No other processes are aaected by a single process being restarted from a checkpoint. This also alleviates the need for application-wide synchronisation in order to generate a set of consistent checkpoints.
منابع مشابه
Software Fault Tolerant Distributed Applications in LiPS
This paper illustrates how software fault tolerant distributed applica tions are implemented within LiPS version a system for distributed computing using idle cycles in networks of workstation The LiPS system SR SR STea Set SF ST SL ST employs the tu ple space programming paradigm as originally used in the Linda programming language Applications implemented using this paradigm easily adapt to c...
متن کاملFirst Steps in the Implementation of a Fault - Tolerant
Transis ADKM92,AAD93,ADM + 93] is a tool for group communication that provides reliable ordered multicast along with membership services and strong group semantics. Transis can currently be used by processes residing on nodes within a BCD (Broadcast Domain). Building distributed applications on top of these services enables the programmer to assume ordering constraints on message delivery even ...
متن کاملThe LiPS Runtime Systems based on Fault-Tolerant Tuple Space Machines
Performing computation using networks of workstations is increasingly becoming an alternative to using a supercomputer. This approach is motivated by the vast quantities of unused idle-time available in workstation networks. Unlike computing on a tightly coupled parallel computer where a xed number of processor nodes is used within a computation, the number of usable nodes in a workstation netw...
متن کاملDesign , Implementation and Performance of aMutex - Token based Fault - Tolerant Tuple
LiPS 1 is a system for distributed computing using idle-cycles in networks of workstations. In its version 2.3, it is currently used at the Universitt at des Saar-landes in Saarbr ucken, Germany to perform computationally intensive applications in the eld of cryptography and computer algebra on a net of approximately 250 workstations. It should be enhanced to work on more than 1000 machines all...
متن کاملChannel Reiication: a Reeective Approach to Fault-tolerant Software Development
Reeective systems can be used to ease the implementation of fault tolerance mechanisms in distributed applications as show in Anc95, Fab94]. In this paper we introduce a new model for reeective computations, and we show how it can be used for building up fault tolerant applications.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996